feat: provide Request instances in skipped request callbacks by lorenz-lb · Pull Request #1927 · apify/crawlee-python

lorenz-lb · 2026-05-31T14:01:58Z

Description

This PR changes the skipped request handler to receive a Request object instead of only the URL str.

Skipped URLs are now converted into Request objects before the callback is invoked. This ensures that request metadata remains available for skipped requests, including request.user_data and other request attributes.

Issues

N/A

Testing

Added/updated unit tests for skipped request handlers.
Verified that skipped request handlers receive the original Request object.
Ran uv run poe check-code.

Checklist

CI passed

Future improvements

While working on this change, I noticed that many parts of the internal processing pipeline operate on str | Request unions.

A possible future improvement could be to standardize on Request objects internally and only accept str | Request at the public API boundary. URLs could then be wrapped into Request instances immediately, allowing the rest of the codebase to operate exclusively on Request objects.

This could simplify typing, reduce repeated union handling, and make request metadata consistently available throughout the pipeline. However, such a change would require broader refactoring and additional testing, so it is outside the scope of this PR.

vdusek · 2026-06-05T08:25:10Z

 ErrorHandler = Callable[[TCrawlingContext, Exception], Awaitable[Request | None]]
 FailedRequestHandler = Callable[[TCrawlingContext, Exception], Awaitable[None]]
-SkippedRequestCallback = Callable[[str, SkippedReason], Awaitable[None]]
+SkippedRequestCallback = Callable[[Request, SkippedReason], Awaitable[None]]


This is a breaking change to a public API. Existing callbacks like async def cb(url: str, reason) will now receive a Request; string usage like url.startswith(...) will crash. It should work with both str and Request.

I don't think you can start sending Request instances into existing callbacks that expect str without breaking BC without runtime type inspection — do we really want to go that way?

I don't think you can start sending Request instances into existing callbacks that expect str without breaking BC without runtime type inspection — do we really want to go that way?

You're right. Well, this may be a v2.0 material.

vdusek · 2026-06-05T08:25:10Z

        if self._on_skipped_request:
            try:
-                await self._on_skipped_request(url, reason)
+                await self._on_skipped_request(request, reason)


This now forwards request straight to the callback, but add_requests() (just below, ~line 841) wasn't updated — it still builds skipped from the original Sequence[str | Request] and passes a possibly-str item here:

for request in requests: check_url = request.url if isinstance(request, Request) else request if await self._is_allowed_based_on_robots_txt_file(check_url): allowed_requests.append(request) else: skipped.append(request) # <- can be a plain str

So with respect_robots_txt_file=True, await crawler.add_requests(['https://disallowed/...']) on a robots-disallowed URL delivers a str to the callback, and request.url (as in the docs example) raises AttributeError → UserDefinedErrorHandlerError.

Suggest normalizing str → Request once at the choke point (here, or in add_requests) instead of only in the two extract_links impls. That also makes the isinstance(request, Request) guard a few lines up dead code (it now contradicts the request: Request annotation). This path also has no test coverage.

vdusek · 2026-06-05T08:25:10Z

            requests = list[Request]()
+            skipped = list[Request]()
+
+            def create_request(request_options: RequestOptions) -> Request | None:


create_request is now duplicated verbatim (including the multi-line debug message) with PlaywrightCrawler.extract_links (_playwright_crawler.py:464). Both crawlers extend BasicCrawler, so this looks like a good candidate for a single shared helper in crawlee/_utils (taking a logger) to keep the two copies from drifting.

vdusek · 2026-06-05T08:25:10Z

+                    context.log.debug(
+                        f'Skipping URL "{request_options["url"]}" due to invalid format: {exc}. '
+                        'This may be caused by a malformed URL or unsupported URL scheme. '
+                        'Please ensure the URL is correct and retry.'


Now that this helper is also used for robots-skipped, auto-discovered links (not just user enqueues), the "Please ensure the URL is correct and retry." wording is misleading — the operator never submitted this URL and there's nothing to retry. Consider a neutral message, or a separate one for the skip path.

vdusek · 2026-06-05T08:25:10Z

                else context.request.loaded_url or context.request.url
            )
            links_iterator = to_absolute_url_iterator(base_url, links_iterator, logger=context.log)
+            skipped_iterator = iter([])


Minor: skipped_iterator = iter([]) followed by a conditional reassignment inside if robots_txt_file: reads a little awkwardly (and iter([]) infers Iterator[Never]). The previous if/else, or guarding the skipped-building loop under the same if robots_txt_file:, is clearer.

vdusek · 2026-06-05T08:25:10Z

+                request = create_request(request_options)
+
+                if request is not None:
+                    skipped.append(request)


Two things about this skipped-building loop:

Behavior change / silent drop: previously every robots-disallowed URL reached the callback as a raw string; now they go through Request.from_url, which enforces http/https (validate_http_url). A disallowed-but-non-http(s) or malformed-but-absolute URL (one that passes to_absolute_url_iterator but fails from_url) is now silently dropped and the skip callback never fires for it. Intended?

Consistency: these skipped requests are built directly and don't pass through transform_request_function, while the enqueued ones do. So a skipped Request carries the base label/user_data/enqueue_strategy but not the user's per-request transform — inconsistent with enqueued requests from the same links.

vdusek · 2026-06-05T08:25:10Z

            requests = list[Request]()
+            skipped = list[Request]()
+
+            def create_request(request_options: RequestOptions) -> Request | None:


Same as the HTTP crawler: this create_request is a verbatim duplicate (candidate for a shared helper), and the skipped-building loop below has the same silent-drop and transform_request_function-not-applied behavior. See the comments on _abstract_http_crawler.py.

vdusek · 2026-06-05T08:25:10Z

+
+    requests = [call.args[0] for call in skip.call_args_list]
+
+    all(isinstance(request, Request) for request in requests)


vdusek · 2026-06-05T08:25:10Z

+
+    requests = [call.args[0] for call in skip.call_args_list]
+
+    all(isinstance(request, Request) for request in requests)


vdusek · 2026-06-05T08:25:10Z

+
+    requests = [call.args[0] for call in skip.call_args_list]
+
+    all(isinstance(request, Request) for request in requests)


lorenz-lb · 2026-06-09T08:29:27Z

Thanks for the review! Sorry for some obvious mistakes, I started to use crawlee only recently, clearly skill issue on my side. I'll incorporate your feedback very very soon!

feat: provide Request instances in skipped request callbacks

32b9755

janbuchar self-requested a review June 3, 2026 19:07

vdusek requested changes Jun 9, 2026

View reviewed changes


		requests = [call.args[0] for call in skip.call_args_list]

		all(isinstance(request, Request) for request in requests)

Conversation

lorenz-lb commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Checklist

Future improvements

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorenz-lb commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lorenz-lb commented May 31, 2026 •

edited

Loading